Opening Questions

What influences happiness?

Happiness over time?

Datasets and pre-processing

Our base dataset is the World Happiness Report ranging from 2015 to 2022. The World Happiness Report is a publication of the United Nations Sustainable Development Solutions Network. It contains articles and rankings of national happiness, based on respondent ratings of their own lives, which the report also correlates with various life factors. The dataset contains 12 columns, 1237 rows and can be seen in the table below. As our goal is to analyse the factors of happiness, the following six columns are most important for us: GDP, Family, Health, Freedom, Corruption and Generosity.
Country Region Happiness.Rank Happiness.Score Standard.Error Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Trust..Government.Corruption. Generosity Dystopia.Residual
Switzerland Western Europe 1 7.587 0.03411 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2.51738
Iceland Western Europe 2 7.561 0.04884 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2.70201
Denmark Western Europe 3 7.527 0.03328 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2.49204

In addition, we wanted to add futher factors and added the following three datasets:

By merging the datasets we have now four additional factors:

To join all the different datasets we had to do some manual preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.

After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to create two datasets. One for analysing the happiness change over time and one for analysing the influential factors regarding happiness in only one year.

For the first dataset, the over time analysis, we only included the 6 factors from the base happiness dataset and excluded all rows containing missing values. We also renamed the columns for having shorter labels.
Country Happiness.Rank Happiness Economy Family Health Freedom Trust Generosity Year Region
Switzerland 1 7.587 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2015 Western Europe
Iceland 2 7.561 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2015 Western Europe
Denmark 3 7.527 1.32548 1.36058 0.87464 0.64938 0.48357 0.34139 2015 Western Europe

For the second dataset, the influential factors analysis, we inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.

Country Happiness.Rank Happiness Economy Family Health Freedom Trust Generosity Year Region Country.Code Code Alcohol Population Tobacco Internet
Finland 1 7.632 1.305 1.592 0.874 0.681 0.393 0.202 2018 Western Europe FI FIN 10.78 5522585 19.7 88.88996
Norway 2 7.594 1.456 1.582 0.861 0.686 0.340 0.286 2018 Western Europe NO NOR 7.41 5337960 13.0 96.49166
Denmark 3 7.555 1.351 1.590 0.868 0.683 0.408 0.284 2018 Western Europe DK DNK 10.26 5752131 18.6 97.31920

missing values full data

missing values 2017

missing values 2018

Preliminary analyses

One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors analysis dataset, as it contains the most explanatory variables.

Boxplots, scale data?

First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “population” and “Internet usage”. As we don’t want to have the following analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)

##    Happiness        Economy           Family          Health      
##  Min.   :2.905   Min.   :0.0760   Min.   :0.372   Min.   :0.0000  
##  1st Qu.:4.486   1st Qu.:0.7040   1st Qu.:1.063   1st Qu.:0.4475  
##  Median :5.483   Median :1.0100   Median :1.314   Median :0.6750  
##  Mean   :5.489   Mean   :0.9335   Mean   :1.247   Mean   :0.6283  
##  3rd Qu.:6.332   3rd Qu.:1.2240   3rd Qu.:1.481   3rd Qu.:0.8180  
##  Max.   :7.632   Max.   :1.5760   Max.   :1.644   Max.   :1.0080  
##                                                                   
##     Freedom           Trust          Generosity        Alcohol      
##  Min.   :0.0250   Min.   :0.0000   Min.   :0.0000   Min.   : 0.003  
##  1st Qu.:0.3875   1st Qu.:0.0500   1st Qu.:0.1020   1st Qu.: 3.220  
##  Median :0.5040   Median :0.0880   Median :0.1670   Median : 7.150  
##  Mean   :0.4758   Mean   :0.1195   Mean   :0.1840   Mean   : 6.842  
##  3rd Qu.:0.5835   3rd Qu.:0.1450   3rd Qu.:0.2545   3rd Qu.:10.385  
##  Max.   :0.7240   Max.   :0.4570   Max.   :0.5980   Max.   :15.090  
##                                                                     
##    Population           Tobacco         Internet    
##  Min.   :3.367e+05   Min.   : 3.70   Min.   : 4.10  
##  1st Qu.:5.488e+06   1st Qu.:13.90   1st Qu.:37.60  
##  Median :1.444e+07   Median :22.20   Median :68.21  
##  Mean   :6.007e+07   Mean   :22.02   Mean   :60.43  
##  3rd Qu.:4.430e+07   3rd Qu.:27.90   3rd Qu.:82.81  
##  Max.   :1.428e+09   Max.   :45.50   Max.   :99.60  
##                                                     
##                                 Region  
##  Sub-Saharan Africa                :27  
##  Western Europe                    :20  
##  Latin America and Caribbean       :14  
##  Central and Eastern Europe        :13  
##  Middle East and North Africa      :12  
##  Commonwealth of Independent States: 9  
##  (Other)                           :20

We can see that every factor is now on the same scale. We have some outliers for Family, Freedom, Trust, Generosity and Population.

correlation matrix

On the correlation matrix plot we see, that happiness has the stronges correlation with Economy (0.833) and Internet (0.817). For the correlations between the explanatory variables the following stand out:

  • 0.934: Economy and Internet
  • 0.875: Economy and Health
  • 0.864: Internet and Health
  • 0.741: Family and Economy
  • 0.728: Internet and Family

regression

One tool for getting a first glance on what influences happiness is linear regression. For the regression we use the unscaled data. If our linear model has good predictability, we can interpret the coefficients on how they influence the outcome. This is also called regression analysis, where the goal is to isolate the relationship between each explanatory variable and the outcome variable.

However, the interpretability assumes that you can only change the value of one explanatory variable and not the others at the same time. This of course is only true if there are no correlations between the explanatory variables. If this independence does not hold, we have a problem of multicollinearity. This can result in the coefficients swingging wildly based on which other independent variables are in the model. Therefore the coefficients become very sensitive to small changes in the model and can not be easily interpreted.

One way to asses how strong the explanatory variables are affected by multicollinearity is using the variance inflation factor (VIF). VIFs identify correlations and their strength. VIFs between 1 and 5 suggest that there is a small correlation, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated.

If we build a linear regression model on all explanatory variables, we get an R-squared of 0.8303. However, by plotting the VIF values we can see that a model based on all explanatory variables has severe multicollinearity. Therefore we can not interprete the coefficients for Internet and Economy.

## 
## Call:
## lm(formula = Happiness ~ ., data = not_scaled_data_factors)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52814 -0.29381  0.02782  0.30897  1.30352 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.139e+00  2.513e-01   8.510 1.39e-13 ***
## Economy      8.474e-01  4.030e-01   2.103 0.037923 *  
## Family       1.004e+00  2.732e-01   3.677 0.000376 ***
## Health       8.986e-01  4.190e-01   2.144 0.034323 *  
## Freedom      7.349e-01  4.203e-01   1.749 0.083302 .  
## Trust        5.337e-01  5.615e-01   0.951 0.343986    
## Generosity   1.129e+00  5.072e-01   2.225 0.028256 *  
## Alcohol      3.129e-03  1.352e-02   0.231 0.817491    
## Population   6.294e-11  2.692e-10   0.234 0.815587    
## Tobacco     -2.120e-02  5.560e-03  -3.812 0.000234 ***
## Internet     9.302e-03  5.284e-03   1.760 0.081276 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4937 on 104 degrees of freedom
## Multiple R-squared:  0.8303, Adjusted R-squared:  0.814 
## F-statistic: 50.89 on 10 and 104 DF,  p-value: < 2.2e-16

If we build a linear regression model without Internet and Economy, we get an R-squared of 0.7924. This R-squared is lower than prior, but after plotting the VIF values we can see that we are allowed to interprete the coefficients for the remaining explanatory variables.

Interesting is that only Family, Health and Tabacco is statistically significant:

  • Family has an positive effect on the Happiness Score. One unit change on Family results in an absolute Happiness increase of 15.98
  • Health has an positive effect on the Happiness Score. One unit change on Health results in an absolute Happiness increase of 24.04
  • Tobacco has an negative effect on the Happiness Score. One unit change on Tobacco results in an absolute Happiness decrease of -0.01884
## 
## Call:
## lm(formula = Happiness ~ . - Internet - Economy, data = not_scaled_data_factors)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5823 -0.3292  0.0333  0.3509  1.3621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.836e+00  2.666e-01   6.886 4.21e-10 ***
## Family       1.598e+00  2.671e-01   5.984 3.00e-08 ***
## Health       2.404e+00  3.057e-01   7.864 3.31e-12 ***
## Freedom      7.007e-01  4.587e-01   1.528  0.12961    
## Trust        9.984e-01  6.059e-01   1.648  0.10237    
## Generosity   6.073e-01  5.377e-01   1.129  0.26124    
## Alcohol      1.741e-04  1.479e-02   0.012  0.99063    
## Population  -2.081e-11  2.842e-10  -0.073  0.94177    
## Tobacco     -1.884e-02  6.036e-03  -3.122  0.00231 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5409 on 106 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7767 
## F-statistic: 50.58 on 8 and 106 DF,  p-value: < 2.2e-16

Next we tried out the linear regrssion methods with shrinkage. For Lasso and Ridge regression all predictor variables should be scaled so that they have the same standard deviation. Otherwise, the predictor variables have weighting in the penalty term. The glmnet() function however standardises the predictors by default and the output coefficients are recalculated to apply to the original scale.

The Ridge regression

x <- model.matrix(Happiness ~ . -Internet - Economy , data = not_scaled_data_factors)[, -1]
y <- not_scaled_data_factors$Happiness

"Ridge Regression"
## [1] "Ridge Regression"
ridge.out <- cv.glmnet(x,y, alpha = 0)
coef(ridge.out)
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  2.418451e+00
## Family       1.158007e+00
## Health       1.530776e+00
## Freedom      9.983247e-01
## Trust        1.207948e+00
## Generosity   3.723445e-01
## Alcohol      1.564960e-02
## Population  -1.074559e-10
## Tobacco     -5.645452e-03

The Lasso Regression

"Lasso Regression"
## [1] "Lasso Regression"
lasso.out <- cv.glmnet(x,y, alpha = 1)
coef(lasso.out)
## 9 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept)  2.065732886
## Family       1.476035508
## Health       1.997754757
## Freedom      0.733453871
## Trust        0.934505831
## Generosity   .          
## Alcohol      .          
## Population   .          
## Tobacco     -0.006076065

PCA (Colour by region) + biplot (or PLS)

SOM

How does happiness change over time?

geography map (color each country base on the percentage change over time (2015-2022))

What further influences happiness?

box <- ggplot(data_2018, aes(x = Region, y = Happiness, color = Region), ) +
  geom_boxplot() + 
  geom_jitter(aes(color=Country), size = 0.5) +
  ggtitle("Happiness Score for Regions and Countries") + 
  coord_flip() + 
  theme(legend.position="none")
ggplotly(box)

Future work